You’ll see someone always has a happy face and someone keeps upset. But why? Curious about why people feel happy, it’s interesting to investigate the happy moment of people who have different countries, genders, ages, marriages and so on. To learn more, a research about happy moment started and analysis follows.
hm_data <- hm_data %>%
inner_join(demo_data, by = "wid") %>%
select(wid,
original_hm,
gender,
marital,
parenthood,
reflection_period,
age,
country,
ground_truth_category,
text) %>%
mutate(count = sapply(hm_data$text, wordcount)) %>%
mutate(word.count = sapply(hm_data$original_hm, wordcount)) %>%
filter(gender %in% c("m", "f")) %>%
filter(marital %in% c("single", "married")) %>%
filter(parenthood %in% c("n", "y")) %>%
filter(reflection_period %in% c("24h", "3m")) %>%
mutate(reflection_period = fct_recode(reflection_period,
months_3 = "3m", hours_24 = "24h"))
head(hm_data,5)
## # A tibble: 5 x 12
## wid original_hm gender marital parenthood reflection_peri~ age
## <int> <chr> <chr> <chr> <chr> <fct> <chr>
## 1 2053 I went on ~ m single n hours_24 35
## 2 2 I was happ~ m married y hours_24 29.0
## 3 1936 I went to ~ f married y hours_24 30
## 4 206 We had a s~ f married n hours_24 28
## 5 45 I meditate~ m single n hours_24 23
## # ... with 5 more variables: country <chr>, ground_truth_category <chr>,
## # text <chr>, count <int>, word.count <int>
suppressWarnings(describe.by(hm_data))
## vars n mean sd median trimmed mad
## wid 1 94574 2680.07 3487.41 1097 1925.46 1306.17
## original_hm* 2 94574 NaN NA NA NaN NA
## gender* 3 94574 NaN NA NA NaN NA
## marital* 4 94574 NaN NA NA NaN NA
## parenthood* 5 94574 NaN NA NA NaN NA
## reflection_period* 6 94574 1.50 0.50 2 1.50 0.00
## age* 7 94541 31.96 10.72 30 30.50 7.41
## country* 8 94427 NaN NA NA NaN NA
## ground_truth_category* 9 13379 NaN NA NA NaN NA
## text* 10 94574 NaN NA NA NaN NA
## count 11 94574 6.16 7.65 5 5.09 2.97
## word.count 12 94574 18.31 21.63 14 14.97 8.90
## min max range skew kurtosis se
## wid 1 13839 13838 1.68 1.75 11.34
## original_hm* Inf -Inf -Inf NA NA NA
## gender* Inf -Inf -Inf NA NA NA
## marital* Inf -Inf -Inf NA NA NA
## parenthood* Inf -Inf -Inf NA NA NA
## reflection_period* 1 2 1 -0.02 -2.00 0.00
## age* 2 233 231 5.14 77.75 0.03
## country* Inf -Inf -Inf NA NA NA
## ground_truth_category* Inf -Inf -Inf NA NA NA
## text* Inf -Inf -Inf NA NA NA
## count 1 509 508 20.56 907.01 0.02
## word.count 2 1155 1153 14.84 493.48 0.07
The data seems good. It has more than 100 thousand observations and 12 variables. The original text “original_hm” has been cleaned and the key words has been put into “text”.
Let’s start with the sentence length. How many words did people use to describe their happy moment?
sentence.length <- hm_data %>%
group_by(gender) %>%
count(word.count, sort = TRUE) %>%
left_join(hm_data %>%
group_by(gender) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "gender"
head(sentence.length,5)
## # A tibble: 5 x 5
## # Groups: gender [1]
## gender word.count n total freq
## <chr> <int> <int> <int> <dbl>
## 1 m 8 3430 55840 0.0614
## 2 m 10 3286 55840 0.0588
## 3 m 11 3258 55840 0.0583
## 4 m 9 3256 55840 0.0583
## 5 m 7 3076 55840 0.0551
It seems people usually use 8 to 13 words to record their happy moment. Female would like to speak more about the happy moment, maybe they have more feelings to express.
ggplot(sentence.length, aes(x = word.count, y= freq, group = factor(1))) +
geom_bar(stat = "identity", color = "cornflowerblue")+
ylim(0,0.07) +
xlim(0,100) +
theme_bw() +
facet_wrap(~ gender, ncol = 2) +
labs(title = "Sentence Length for Male and Female", x = "Word Numbers", y = "Frequency")
## Warning: Removed 246 rows containing missing values (position_stack).
head(sentence.length$word.count[sentence.length$gender == "f"],5)
## [1] 11 12 10 9 13
head(sentence.length$word.count[sentence.length$gender == "m"],5)
## [1] 8 10 11 9 7
data(stop_words)
bag_of_words <- hm_data %>%
unnest_tokens(word, text)
head(bag_of_words,5)
## # A tibble: 5 x 12
## wid original_hm gender marital parenthood reflection_peri~ age
## <int> <chr> <chr> <chr> <chr> <fct> <chr>
## 1 2053 I went on ~ m single n hours_24 35
## 2 2053 I went on ~ m single n hours_24 35
## 3 2053 I went on ~ m single n hours_24 35
## 4 2053 I went on ~ m single n hours_24 35
## 5 2 I was happ~ m married y hours_24 29.0
## # ... with 5 more variables: country <chr>, ground_truth_category <chr>,
## # count <int>, word.count <int>, word <chr>
Overall, time, friend and day are of the highest frequency. Then, home, family, game and others make people feel happy.
frequency_word <- bag_of_words %>%
count(word)
word_cloud <- data.frame(word = frequency_word$word, freq = frequency_word$n)
set.seed(1)
suppressWarnings(wordcloud(words = word_cloud$word, freq = word_cloud$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2")))
We continue to discuss the difference of happy moment of male and female.
frequency_gender <- bag_of_words %>%
group_by(gender) %>%
count(word, sort = TRUE) %>%
left_join(bag_of_words %>%
group_by(gender) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "gender"
head(frequency_gender,5)
## # A tibble: 5 x 5
## # Groups: gender [2]
## gender word n total freq
## <chr> <chr> <int> <int> <dbl>
## 1 m friend 6366 340130 0.0187
## 2 m day 5267 340130 0.0155
## 3 m time 5142 340130 0.0151
## 4 f day 4103 242776 0.0169
## 5 f time 4055 242776 0.0167
frequency_gender <- frequency_gender %>%
select(gender, word, freq) %>%
spread(gender, freq) %>%
arrange(m, f)
frequency_gender
## # A tibble: 18,812 x 3
## word f m
## <chr> <dbl> <dbl>
## 1 abdomen 0.00000412 0.00000294
## 2 abdominal 0.00000412 0.00000294
## 3 abnormalities 0.00000412 0.00000294
## 4 aca 0.00000412 0.00000294
## 5 achy 0.00000412 0.00000294
## 6 adidas 0.00000412 0.00000294
## 7 administered 0.00000412 0.00000294
## 8 afternoonget 0.00000412 0.00000294
## 9 aggregate 0.00000412 0.00000294
## 10 alexis 0.00000412 0.00000294
## # ... with 18,802 more rows
ggplot(frequency_gender, aes(f, m)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "black") +
theme_bw() +
labs(title = "Word Freqencies for Male and Female", x = "Female", y = "Male")
## Warning: Removed 11068 rows containing missing values (geom_point).
## Warning: Removed 11068 rows containing missing values (geom_text).
The figure above shows people in both gender feel good due to victory without difference, and they persue winning something. No doubts it’s a popular value at present. What’s more, males and females are both feel happy due to their partners of high frequency. The words, “husband” and “boyfriend” can delight females significantly. As for difference, females feel good from people around them, such as sister, dad and kids. Jewelry and kindergarten as well help females feel better. For males, beer, mobile, and hill please them a lot.
Then, we move to different countries. Choose the countries of large population and area with different cultures.
frequency_country <- bag_of_words %>%
group_by(country) %>%
count(word, sort = TRUE) %>%
left_join(bag_of_words %>%
group_by(country) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "country"
head(frequency_country,5)
## # A tibble: 5 x 5
## # Groups: country [2]
## country word n total freq
## <chr> <chr> <int> <int> <dbl>
## 1 USA friend 6917 417407 0.0166
## 2 USA time 6191 417407 0.0148
## 3 USA day 5812 417407 0.0139
## 4 USA watched 3389 417407 0.00812
## 5 IND day 3167 138456 0.0229
ggplot(frequency_country, aes(x=country, y=freq)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
theme_bw() +
scale_y_log10(labels = percent_format()) +
scale_x_discrete(limits=c("USA", "IND", "CAN", "JPN", "MEX", "PAK", "ITA", "GBR", "IDN", "DEU","FRA")) +
labs(title = "Word Freqencies for Different Countries", x = "Country", y = "Frequency")
## Warning: Removed 11413 rows containing missing values (geom_point).
## Warning: Removed 11413 rows containing missing values (geom_text).
The figure above shows that people in America, Canada, Japan, Indonesia and France really enjoy their life with friends’ company. Englishmen desire success but they also love afternoon tea. Italians need to find a balance between two amazing things, cars and beer. A wonderful life in their mind is driving a car to drink some beer. Indians and Pakistan cherish their family members more. Big house satisfies Pakistan’s dream about perfect life.
frequency_marital <- bag_of_words %>%
group_by(marital) %>%
count(word, sort = TRUE) %>%
left_join(bag_of_words %>%
group_by(marital) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "marital"
head(frequency_marital,5)
## # A tibble: 5 x 5
## # Groups: marital [2]
## marital word n total freq
## <chr> <chr> <int> <int> <dbl>
## 1 single friend 6871 315218 0.0218
## 2 single day 5055 315218 0.0160
## 3 single time 4918 315218 0.0156
## 4 married day 4315 267688 0.0161
## 5 married time 4279 267688 0.0160
frequency_marital <- frequency_marital %>%
select(marital, word, freq) %>%
spread(marital, freq) %>%
arrange(single, married)
frequency_marital
## # A tibble: 18,812 x 3
## word married single
## <chr> <dbl> <dbl>
## 1 abdominal 0.00000374 0.00000317
## 2 abig 0.00000374 0.00000317
## 3 abnormalities 0.00000374 0.00000317
## 4 aca 0.00000374 0.00000317
## 5 accent 0.00000374 0.00000317
## 6 acclimated 0.00000374 0.00000317
## 7 achy 0.00000374 0.00000317
## 8 acute 0.00000374 0.00000317
## 9 aerospace 0.00000374 0.00000317
## 10 afternoonget 0.00000374 0.00000317
## # ... with 18,802 more rows
ggplot(frequency_marital, aes(single, married)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "black") +
theme_bw() +
labs(title = "Word Freqencies for Singe and Married People", x = "Single", y = "Married")
## Warning: Removed 11024 rows containing missing values (geom_point).
## Warning: Removed 11024 rows containing missing values (geom_text).
“Home” and “house” make all the people married or not feel warm and happy. A single person can enjoy their “time” with their girlfriend or boyfriend, roommate, fiancee, and even a cat. But married people enjoy more happy moments with their children. It may results from they have became parent. Therefore, next analysis is about parenthood.
frequency_parenthood <- bag_of_words %>%
group_by(parenthood) %>%
count(word, sort = TRUE) %>%
left_join(bag_of_words %>%
group_by(parenthood) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "parenthood"
head(frequency_parenthood,5)
## # A tibble: 5 x 5
## # Groups: parenthood [2]
## parenthood word n total freq
## <chr> <chr> <int> <int> <dbl>
## 1 n friend 7312 345196 0.0212
## 2 n day 5457 345196 0.0158
## 3 n time 5399 345196 0.0156
## 4 y day 3913 237710 0.0165
## 5 y time 3798 237710 0.0160
frequency_parenthood <- frequency_parenthood %>%
select(parenthood, word, freq) %>%
spread(parenthood, freq) %>%
arrange(y, n)
frequency_parenthood
## # A tibble: 18,812 x 3
## word n y
## <chr> <dbl> <dbl>
## 1 abdomen 0.00000290 0.00000421
## 2 abdominal 0.00000290 0.00000421
## 3 aca 0.00000290 0.00000421
## 4 accent 0.00000290 0.00000421
## 5 acclimated 0.00000290 0.00000421
## 6 achy 0.00000290 0.00000421
## 7 acute 0.00000290 0.00000421
## 8 administered 0.00000290 0.00000421
## 9 aerospace 0.00000290 0.00000421
## 10 afternoonget 0.00000290 0.00000421
## # ... with 18,802 more rows
ggplot(frequency_parenthood, aes(y,n)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "black") +
theme_bw() +
labs(title = "Word Freqencies for Parenthood or not", x = "Y", y = "N")
## Warning: Removed 11230 rows containing missing values (geom_point).
## Warning: Removed 11230 rows containing missing values (geom_text).
Actually, parents pay more attention to their kids or grandkids, but people without a baby consider their midterm more. It’s reasonable since students who care exams have high probability of not having a kid. Whatever a parent or not, people need tasty food, satisfied job, amazing parties and good car to feel good.
But what will leave in your memory for longer time?
frequency_reflection_period <- bag_of_words %>%
group_by(reflection_period) %>%
count(word, sort = TRUE) %>%
left_join(bag_of_words %>%
group_by(reflection_period) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "reflection_period"
head(frequency_reflection_period,5)
## # A tibble: 5 x 5
## # Groups: reflection_period [2]
## reflection_period word n total freq
## <fct> <chr> <int> <int> <dbl>
## 1 months_3 friend 5552 302221 0.0184
## 2 months_3 day 5281 302221 0.0175
## 3 months_3 time 4883 302221 0.0162
## 4 hours_24 friend 4775 280685 0.0170
## 5 hours_24 time 4314 280685 0.0154
frequency_reflection_period <- frequency_reflection_period %>%
select(reflection_period, word, freq) %>%
spread(reflection_period, freq) %>%
arrange(hours_24, months_3)
frequency_reflection_period
## # A tibble: 18,812 x 3
## word hours_24 months_3
## <chr> <dbl> <dbl>
## 1 abdomen 0.00000356 0.00000331
## 2 abdominal 0.00000356 0.00000331
## 3 aca 0.00000356 0.00000331
## 4 achy 0.00000356 0.00000331
## 5 adidas 0.00000356 0.00000331
## 6 afterall 0.00000356 0.00000331
## 7 afternoonget 0.00000356 0.00000331
## 8 agreeable 0.00000356 0.00000331
## 9 ahi 0.00000356 0.00000331
## 10 aircondition 0.00000356 0.00000331
## # ... with 18,802 more rows
ggplot(frequency_reflection_period, aes(hours_24, months_3)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "black") +
theme_bw() +
labs(title = "Word Freqencies for Different Reflection Period", x = "24 Hours", y = "3 Months")
## Warning: Removed 10747 rows containing missing values (geom_point).
## Warning: Removed 10747 rows containing missing values (geom_text).
Seems a good morning, a cute dog, or even a interesting video can make your day. But with time passes, the important date, such as Valentine’s Day, graduation day and birthday impress you the most in three months.
Will things become different in varous ages?
frequency_age <- bag_of_words %>%
group_by(age) %>%
count(word, sort = TRUE) %>%
left_join(bag_of_words %>%
group_by(age) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "age"
# Remove all the non-numeric characters in age column and convert age to numeric
pattern <- "[0-9]*"
frequency_age <- frequency_age[grepl("[0-9]*",frequency_age$age),]
frequency_age <- frequency_age[!grepl("prefer not to say",frequency_age$age),]
frequency_age <- frequency_age[!grepl("[a-z]^",frequency_age$age),]
frequency_age$age <- as.integer(frequency_age$age)
## Warning: NAs introduced by coercion
frequency_age <-na.omit(frequency_age)
frequency_age <- frequency_age[frequency_age$age <= 95,]
max(frequency_age$age)
## [1] 95
nrow(frequency_age)
## [1] 147003
ggplot(frequency_age, aes(x = age, y = freq)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25, color = "#56B4E9") +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_continuous(limits = c(0,100)) +
theme_bw() +
scale_y_log10(labels = percent_format()) +
labs(title = "Word Freqencies for Different Ages", x = "Age", y = "Frequency")
When you are a child, you may feel good because birthday gift is a new bike. But when you grow up, friends support you and you start to pursue sucess in your career. Then getting older makes you want to spend more time with family. When you are around 80 years old, planting become attractive.
frequency_age$age_range <- floor(frequency_age$age/10)*10
frequency_age %>%
group_by(age_range, word) %>%
summarise(freq = sum(freq)) %>%
left_join(frequency_age %>%
group_by(age_range, word))
## Joining, by = c("age_range", "word", "freq")
## # A tibble: 40,510 x 6
## # Groups: age_range [?]
## age_range word freq age n total
## <dbl> <chr> <dbl> <int> <int> <int>
## 1 0 actress 0.0217 2 1 46
## 2 0 aircondition 0.00461 3 1 217
## 3 0 appreciated 0.0217 2 1 46
## 4 0 backyard 0.0217 2 1 46
## 5 0 bag 0.00461 3 1 217
## 6 0 bar 0.00461 3 1 217
## 7 0 beach 0.00461 3 1 217
## 8 0 beer 0.0217 2 1 46
## 9 0 bike 0.00922 3 2 217
## 10 0 biriyani 0.00461 3 1 217
## # ... with 40,500 more rows
frequency_age %>%
group_by(age_range) %>%
arrange(desc(freq)) %>%
slice(1:5) %>%
ggplot(aes(word, freq, fill = freq)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ age_range, scales = "free") +
ylab("Frequency") +
coord_flip()
When investigate deeper, some details attract me. Students concern passing the tests and receive an offer of admission from dream school. Granddaughter could be really sweet to grandparents in their sixties.
At first, get the overall average sentiment score according to the ground truth category. No doubts the happy moments contain almost positive elements in every category.
words_by_category <- bag_of_words %>%
count(ground_truth_category, word, sort = TRUE) %>%
ungroup()
words_by_category
## # A tibble: 31,191 x 3
## ground_truth_category word n
## <chr> <chr> <int>
## 1 <NA> friend 8679
## 2 <NA> day 7988
## 3 <NA> time 7795
## 4 <NA> family 3779
## 5 <NA> watched 3534
## 6 <NA> home 3418
## 7 <NA> played 3303
## 8 <NA> feel 3259
## 9 <NA> finally 3151
## 10 <NA> found 3031
## # ... with 31,181 more rows
word_sentiments <- words_by_category %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(ground_truth_category) %>%
summarize(score = sum(score * n) / sum(n))
word_sentiments %>%
mutate(ground_truth_category = reorder(ground_truth_category, score)) %>%
ggplot(aes(ground_truth_category, score, fill = score > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("Average sentiment score")
Then, sentiment analysis by word shows the majority of words express positive feelings and only two words “lost” and “bad” are negative. Guess the happy moment happened after the “lost” and “bad” moment.
contributions <- bag_of_words %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(word) %>%
summarize(occurences = n(),
contribution = sum(score))
contributions
## # A tibble: 820 x 3
## word occurences contribution
## <chr> <int> <int>
## 1 abandoned 19 -38
## 2 ability 62 124
## 3 aboard 10 10
## 4 absentee 1 -1
## 5 absorbed 6 6
## 6 abusive 16 -48
## 7 accepted 562 562
## 8 accident 103 -206
## 9 accidentally 37 -74
## 10 accomplished 309 618
## # ... with 810 more rows
contributions %>%
top_n(25, abs(contribution)) %>%
mutate(word = reorder(word, contribution)) %>%
ggplot(aes(word, contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
coord_flip()
Now, try sentiment analysis by word in different ground truth categories. The negative word appears in achievement category, and the guess above may be right. First you lose and then you get something makes you happy.
top_sentiment_words <- words_by_category %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(contribution = score * n / sum(n))
top_sentiment_words <- na.omit(top_sentiment_words)
top_sentiment_words %>%
group_by(ground_truth_category) %>%
top_n(5, abs(contribution)) %>%
ungroup() %>%
mutate(word = reorder(word, contribution)) %>%
ggplot(aes(word, contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ ground_truth_category, scales = "free", ncol = 3) +
coord_flip()
Try to find some correlation between different category and the happy moment. Find tf-idf within different ground truth categories.
tf_idf <- words_by_category %>%
bind_tf_idf(word, ground_truth_category, n) %>%
arrange(desc(tf_idf))
tf_idf
## # A tibble: 31,191 x 6
## ground_truth_category word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 exercise exercise 32 0.0345 0.560 0.0193
## 2 nature blooming 11 0.00778 1.25 0.00975
## 3 exercise gym 53 0.0571 0.154 0.00880
## 4 affection daughter 419 0.0136 0.560 0.00761
## 5 leisure movie 228 0.0441 0.154 0.00679
## 6 leisure wend 28 0.00541 1.25 0.00678
## 7 exercise yoga 10 0.0108 0.560 0.00603
## 8 exercise workout 33 0.0356 0.154 0.00548
## 9 exercise jog 6 0.00647 0.847 0.00548
## 10 exercise weight 9 0.00970 0.560 0.00543
## # ... with 31,181 more rows
The 10 terms with the highest tf-idf within each of the ground truth category.
tf_idf %>%
group_by(ground_truth_category) %>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf, fill = ground_truth_category)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ ground_truth_category, scales = "free") +
ylab("tf-idf") +
coord_flip()
category_cors <- words_by_category %>%
pairwise_cor(ground_truth_category, word, n, sort = TRUE)
## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
category_cors <- na.omit(category_cors)
category_cors
## # A tibble: 42 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 enjoy_the_moment affection 0.612
## 2 affection enjoy_the_moment 0.612
## 3 enjoy_the_moment achievement 0.575
## 4 achievement enjoy_the_moment 0.575
## 5 achievement affection 0.463
## 6 affection achievement 0.463
## 7 enjoy_the_moment leisure 0.434
## 8 leisure enjoy_the_moment 0.434
## 9 nature enjoy_the_moment 0.394
## 10 enjoy_the_moment nature 0.394
## # ... with 32 more rows
category_cors %>%
filter(correlation > .1) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation, width = correlation)) +
geom_node_point(size = 6, color = "lightblue") +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
Seems every category has relationship with each other. Maybe because the they both use the similar happy words to describe happy moment.
usenet_bigrams <- hm_data %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
usenet_bigram_counts <- usenet_bigrams %>%
count(ground_truth_category, bigram, sort = TRUE) %>%
ungroup() %>%
separate(bigram, c("word1", "word2"), sep = " ")
usenet_bigram_counts <- na.omit(usenet_bigram_counts[order(usenet_bigram_counts$n),])
# Pick top ten first word with highest frequency in the bigram data
head(sort(table(usenet_bigram_counts$word1), decreasing = T),10)
##
## day time friend found finally played watched home son
## 566 510 501 321 315 289 268 264 262
## feel
## 257
Select the top ten first word with highest frequency in the bigram data to find the phrases.
happy_words <- c("day", "time","friend","finally", "found", "watched", "played","daughter", "son", "home")
usenet_bigram_counts %>%
filter(word1 %in% happy_words) %>%
count(word1, word2, wt = n, sort = TRUE) %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
mutate(contribution = score * nn) %>%
group_by(word1) %>%
top_n(10, abs(contribution)) %>%
ungroup() %>%
mutate(word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
ggplot(aes(word2, contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ word1, scales = "free", nrow = 3) +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
xlab("Words preceded by a happy word") +
ylab("Sentiment score * # of occurrences") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()
When you finally finish something, you feel really good. You will be glad to hear the good news from your kids. Three things also make your day: stay at home, have fun with friends and watch video.
If we ignore the ground truth category and find some new classifications.
# include only words that occur at least 50 times
topic_word <- bag_of_words %>%
group_by(word) %>%
mutate(word_total = n()) %>%
ungroup() %>%
filter(word_total > 50)
# convert into a document-term matrix
# with document names such as sci.crypt_14147
topic_dtm <- topic_word %>%
unite(document, ground_truth_category, wid) %>%
count(document, word) %>%
cast_dtm(document, word, n)
## Warning: Trying to compute distinct() for variables not found in the data:
## - `row_col`, `column_col`
## This is an error, but only a warning is raised for compatibility reasons.
## The operation will return the input unchanged.
library(topicmodels)
topic_lda <- LDA(topic_dtm, k = 8, control = list(seed = 2018))
topic_lda %>%
tidy() %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip()
Divide the sample into eight types, and we can create some labels for them. Type 1 could be someone who really enjoys family life and expects surprise. Type 2 could be a primary school student. Type 3 could be professional women in their twenties. Type 4 could be beautiful girls who like shopping and trip. Type 5 could be college boy who is dating with someone. Type 6 could be married guys who have one or more kids. Type 7 could be a positive person who love the world. Type 8 could be a quiet guy who raises a dog or cat and loves reading.
Interesting analysis about happy moment ends here.